Leslie REGAD
2019-02-21
layout(matrix(c(1,2), 1, 2, byrow = TRUE))
## show the regions that have been allocated to each plot
layout.show(2)str(iris)'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Objectif : prédire la variété des iris en fonction de leur description
\(\quad\)
ind.app <- sample(1:nrow(iris), size = nrow(iris)*2/3)
mat.app <- iris[ind.app,]
summary(mat.app) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.30 Min. :2.000 Min. :1.100 Min. :0.100 setosa :32
1st Qu.:5.10 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:35
Median :5.75 Median :3.000 Median :4.250 Median :1.300 virginica :33
Mean :5.83 Mean :3.053 Mean :3.761 Mean :1.196
3rd Qu.:6.40 3rd Qu.:3.300 3rd Qu.:5.025 3rd Qu.:1.800
Max. :7.70 Max. :4.400 Max. :6.700 Max. :2.500
mat.test <- iris[-ind.app,]
summary(mat.test) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.400 Min. :2.200 Min. :1.000 Min. :0.100 setosa :18
1st Qu.:5.125 1st Qu.:2.800 1st Qu.:1.525 1st Qu.:0.250 versicolor:15
Median :5.800 Median :3.000 Median :4.400 Median :1.300 virginica :17
Mean :5.870 Mean :3.066 Mean :3.752 Mean :1.206
3rd Qu.:6.500 3rd Qu.:3.375 3rd Qu.:5.100 3rd Qu.:1.900
Max. :7.900 Max. :3.900 Max. :6.900 Max. :2.500
vcol.set <- rep("green", length <- nrow(iris))
vcol.set[ind.app] <- "red"
par(mfrow=c(1,2))
pca.res <- PCA(iris[,-5], graph = FALSE)
plot(pca.res, choix="ind", col.ind = vcol.set, label="none")
legend("topleft", c("training set", "test set"), col = c("red", "green"), lty=1)
barplot(cbind(table(mat.app[,"Species"]), table(mat.test[,"Species"])), beside=T, names.arg = c("Apprentissage", "Test"), col=c(4,5,6))
legend("topright", legend=unique(mat.app[,"Species"]), col=c(4,5,6), pch=15)rpart.fit <- rpart(Species ~ . , data = mat.app)
prp(rpart.fit, type=0, extra=1, nn=TRUE)predict.app <- predict(rpart.fit, newdata = mat.app, type="class")
tc <- table(mat.app[,"Species"], predict.app)
tc predict.app
setosa versicolor virginica
setosa 32 0 0
versicolor 0 32 3
virginica 0 1 32
TBP <- sum(diag(tc))/sum(tc)
TBP[1] 0.96
predict.test <- predict(rpart.fit, newdata = mat.test, type="class")
tc <- table(mat.test[,"Species"], predict.test)
tc predict.test
setosa versicolor virginica
setosa 18 0 0
versicolor 0 12 3
virginica 0 0 17
TBP <- sum(diag(tc))/sum(tc)
TBP[1] 0.94
Petal.Length et Petal.Width sont nécéssaires pour prédire l'espèce.nf<-layout(matrix(c(1,2), 1, 2))
prp(rpart.fit, type=0, extra=1, nn=TRUE)
corrplot(cor(iris[,-5]), method="circle", tl.cex=0.8, cl.cex=0.8)\(\quad\)
\(\rightarrow\) aucune information sur les autres descripteurs
dim(mat.app)[1] 100 5
summary(mat.app) Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.30 Min. :2.000 Min. :1.100 Min. :0.100 setosa :32
1st Qu.:5.10 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:35
Median :5.75 Median :3.000 Median :4.250 Median :1.300 virginica :33
Mean :5.83 Mean :3.053 Mean :3.761 Mean :1.196
3rd Qu.:6.40 3rd Qu.:3.300 3rd Qu.:5.025 3rd Qu.:1.800
Max. :7.70 Max. :4.400 Max. :6.700 Max. :2.500
ech.Boot1 <- sample(1:nrow(mat.app), size = nrow(mat.app), replace=T)
sort(ech.Boot1) [1] 1 5 5 6 6 8 9 10 10 12 12 12 13 14 15 16 17 23 24 24 25 25 26 27 27 32 33 34 36 39 42 43 43 43 44 44 45 45 47 48 48 48 50 50 50 50 51 51 52 53 54 57 57 60 61 62 62 62 63 65 65 65 66 68 68 68 69 70 70 71 71 72 72 75 75 77 78 80 82 82 83 83 85 86 86 87 87 88 88 88 89 91 93 93 96 97 97 97
[99] 99 99
ech.OOB1 <- setdiff(1:nrow(mat.app), ech.Boot1)
ech.OOB1 [1] 2 3 4 7 11 18 19 20 21 22 28 29 30 31 35 37 38 40 41 46 49 55 56 58 59 64 67 73 74 76 79 81 84 90 92 94 95 98 100
nbr.var <- 2
list.var <- colnames(iris)[-5]
var.sel <- sample(list.var, size = nbr.var)
var.sel[1] "Sepal.Width" "Petal.Length"
random.data1 <- mat.app[ech.Boot1, var.sel]
head(random.data1) Sepal.Width Petal.Length
51 3.2 4.7
142 3.1 5.1
133 2.8 5.6
59 2.9 4.6
15 4.0 1.2
113 3.0 5.5
A partir de ces données, on apprend le noeud de l'arbre.
library(randomForest)
rf.fit <- randomForest(Species~., data=mat.app, mtree = 500, mtry = 2, importance=TRUE)
rf.fit
Call:
randomForest(formula = Species ~ ., data = mat.app, mtree = 500, mtry = 2, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 5%
Confusion matrix:
setosa versicolor virginica class.error
setosa 32 0 0 0.00000000
versicolor 0 33 2 0.05714286
virginica 0 3 30 0.09090909
head(rf.fit$err.rate) OOB setosa versicolor virginica
[1,] 0.00000000 0 0.00000000 0.00000000
[2,] 0.03333333 0 0.05263158 0.05263158
[3,] 0.03947368 0 0.04166667 0.08000000
[4,] 0.03658537 0 0.08000000 0.03703704
[5,] 0.05555556 0 0.06666667 0.10000000
[6,] 0.08510638 0 0.09677419 0.16129032
plot(rf.fit, main="Performance du modèle")varImpPlot(rf.fit)\(\quad\) \(\quad\)